ai safety institute
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
Korbak, Tomek, Balesni, Mikita, Barnes, Elizabeth, Bengio, Yoshua, Benton, Joe, Bloom, Joseph, Chen, Mark, Cooney, Alan, Dafoe, Allan, Dragan, Anca, Emmons, Scott, Evans, Owain, Farhi, David, Greenblatt, Ryan, Hendrycks, Dan, Hobbhahn, Marius, Hubinger, Evan, Irving, Geoffrey, Jenner, Erik, Kokotajlo, Daniel, Krakovna, Victoria, Legg, Shane, Lindner, David, Luan, David, Mądry, Aleksander, Michael, Julian, Nanda, Neel, Orr, Dave, Pachocki, Jakub, Perez, Ethan, Phuong, Mary, Roger, Fabien, Saxe, Joshua, Shlegeris, Buck, Soto, Martín, Steinberger, Eric, Wang, Jasmine, Zaremba, Wojciech, Baker, Bowen, Shah, Rohin, Mikulik, Vlad
AI systems that "think" in human language offer a unique opportunity for AI safety: we can monitor their chains of thought (CoT) for the intent to misbehave. Like all other known AI oversight methods, CoT monitoring is imperfect and allows some misbehavior to go unnoticed. Nevertheless, it shows promise and we recommend further research into CoT monitorability and investment in CoT monitoring alongside existing safety methods. Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability.
- North America > United States (0.14)
- North America > Canada > Ontario > Toronto (0.14)
- North America > Canada > Quebec > Montreal (0.04)
- Information Technology > Security & Privacy (0.46)
- Government > Military (0.46)
From Turing to Tomorrow: The UK's Approach to AI Regulation
Ritchie, Oliver, Anderljung, Markus, Rachman, Tom
The UK has pursued a distinctive path in AI regulation: less cautious than the EU but more willing to address risks than the US, and has emerged as a global leader in coordinating AI safety efforts. Impressive developments from companies like London-based DeepMind began to spark concerns in the UK about catastrophic risks from around 2012, although regulatory discussion at the time focussed on bias and discrimination. By 2022, these discussions had evolved into a "pro-innovation" strategy, in which the government directed existing regulators to take a light-touch approach, governing AI at point of use, but avoided regulating the technology or infrastructure directly. ChatGPT arrived in late 2022, galvanising concerns that this approach may be insufficient. The UK responded by establishing an AI Safety Institute to monitor risks and hosting the first international AI Safety Summit in 2023, but - unlike the EU - refrained from regulating frontier AI development in addition to its use. A new government was elected in 2024 which promised to address this gap, but at the time of writing is yet to do so. What should the UK do next? The government faces competing objectives: harnessing AI for economic growth and better public services while mitigating risk. In light of these, we propose establishing a flexible, principles-based regulator to oversee the most advanced AI development, defensive measures against risks from AI-enabled biological design tools, and argue that more technical work is needed to understand how to respond to AI-generated misinformation. We argue for updated legal frameworks on copyright, discrimination, and AI agents, and that regulators will have a limited but important role if AI substantially disrupts labour markets. If the UK gets AI regulation right, it could demonstrate how democratic societies can harness AI's benefits while managing its risks.
- North America > United States > California (0.28)
- Europe > France (0.14)
- North America > Canada > Ontario > Toronto (0.14)
- (11 more...)
- Law > Statutes (1.00)
- Government > Regional Government > North America Government > United States Government (1.00)
- Government > Regional Government > Europe Government > United Kingdom Government (1.00)
- Banking & Finance > Economy (1.00)
Under Trump, AI Scientists Are Told to Remove 'Ideological Bias' From Powerful Models
The National Institute of Standards and Technology (NIST) has issued new instructions to scientists that partner with the US Artificial Intelligence Safety Institute (AISI) that eliminate mention of "AI safety," "responsible AI," and "AI fairness" in the skills it expects of members and introduces a request to prioritize "reducing ideological bias, to enable human flourishing and economic competitiveness." The information comes as part of an updated cooperative research and development agreement for AI Safety Institute consortium members, sent in early March. Previously, that agreement encouraged researchers to contribute technical work that could help identify and fix discriminatory model behavior related to gender, race, age, or wealth inequality. Such biases are hugely important because they can directly affect end users and disproportionately harm minorities and economically disadvantaged groups. The new agreement removes mention of developing tools "for authenticating content and tracking its provenance" as well as "labeling synthetic content," signaling less interest in tracking misinformation and deep fakes.
WebGames: Challenging General-Purpose Web-Browsing AI Agents
Thomas, George, Chan, Alex J., Kang, Jikun, Wu, Wenqi, Christianos, Filippos, Greenlee, Fraser, Toulis, Andy, Purtorab, Marvin
We introduce WebGames, a comprehensive benchmark suite designed to evaluate general-purpose web-browsing AI agents through a collection of 50+ interactive challenges. These challenges are specifically crafted to be straightforward for humans while systematically testing the limitations of current AI systems across fundamental browser interactions, advanced input processing, cognitive tasks, workflow automation, and interactive entertainment. Our framework eliminates external dependencies through a hermetic testing environment, ensuring reproducible evaluation with verifiable ground-truth solutions. We evaluate leading vision-language models including GPT-4o, Claude Computer-Use, Gemini-1.5-Pro, and Qwen2-VL against human performance. Results reveal a substantial capability gap, with the best AI system achieving only 43.1% success rate compared to human performance of 95.7%, highlighting fundamental limitations in current AI systems' ability to handle common web interaction patterns that humans find intuitive. The benchmark is publicly available at webgames.convergence.ai, offering a lightweight, client-side implementation that facilitates rapid evaluation cycles. Through its modular architecture and standardized challenge specifications, WebGames provides a robust foundation for measuring progress in development of more capable web-browsing agents.
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
The National Institute of Standards and Technology Braces for Mass Firings
Sweeping layoffs architected by the Trump administration and the so-called Department of Government Efficiency may be coming as soon as this week at the National Institute of Standards and Technology (NIST), a non-regulatory agency responsible for establishing benchmarks that ensure everything from beauty products to quantum computers are safe and reliable. According to several current and former employees at NIST, the agency has been bracing for cuts since President Donald Trump took office last month and ordered billionaire Elon Musk and DOGE to slash spending across the federal government. The fears were heightened last week when some NIST workers witnessed a handful of people they believed to be associated with DOGE inside Building 225, which houses the NIST Information Technology Laboratory at the agency's Gaithersburg, Maryland campus, according to multiple people briefed on the sightings. The DOGE staff were seeking access to NIST's IT systems, one of the people said. Soon after the purported visit, NIST leadership told employees that DOGE staffers were not currently on campus, but that office space and technology were being provisioned for them, according to the same people.
Using tournaments to calculate AUROC for zero-shot classification with LLMs
Yoon, Wonjin, Bulovic, Ian, Miller, Timothy A.
Large language models perform surprisingly well on many zero-shot classification tasks, but are difficult to fairly compare to supervised classifiers due to the lack of a modifiable decision boundary. In this work, we propose and evaluate a method that converts binary classification tasks into pairwise comparison tasks, obtaining relative rankings from LLMs. Repeated pairwise comparisons can be used to score instances using the Elo rating system (used in chess and other competitions), inducing a confidence ordering over instances in a dataset. We evaluate scheduling algorithms for their ability to minimize comparisons, and show that our proposed algorithm leads to improved classification performance, while also providing more information than traditional zero-shot classification.
- Research Report > Experimental Study (0.69)
- Research Report > New Finding (0.47)
- Health & Medicine (1.00)
- Leisure & Entertainment > Games > Chess (0.58)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.33)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.31)
Which Information should the UK and US AISI share with an International Network of AISIs? Opportunities, Risks, and a Tentative Proposal
The UK AI Safety Institute (UK AISI) and its parallel organisation in the United States (US AISI) take up a unique position in the recently established International Network of AISIs. Both are in jurisdictions with frontier AI companies and are assuming leading roles in the international conversation on AI Safety. This paper argues that it is in the interest of both institutions to share specific categories of information with the International Network of AISIs, deliberately abstain from sharing others and carefully evaluate sharing some categories on a case by case basis, according to domestic priorities. The paper further proposes a provisional framework with which policymakers and researchers can distinguish between these three cases, taking into account the potential benefits and risks of sharing specific categories of information, ranging from pre-deployment evaluation results to evaluation standards. In an effort to further improve the research on AI policy relevant information sharing decisions, the paper emphasises the importance of continuously monitoring fluctuating factors influencing sharing decisions and a more in-depth analysis of specific policy relevant information categories and additional factors to consider in future research.
Enabling External Scrutiny of AI Systems with Privacy-Enhancing Technologies
This article describes how technical infrastructure developed by the nonprofit OpenMined enables external scrutiny of AI systems without compromising sensitive information. Independent external scrutiny of AI systems provides crucial transparency into AI development, so it should be an integral component of any approach to AI governance. In practice, external researchers have struggled to gain access to AI systems because of AI companies' legitimate concerns about security, privacy, and intellectual property. But now, privacy-enhancing technologies (PETs) have reached a new level of maturity: end-to-end technical infrastructure developed by OpenMined combines several PETs into various setups that enable privacy-preserving audits of AI systems. We showcase two case studies where this infrastructure has been deployed in real-world governance scenarios: "Understanding Social Media Recommendation Algorithms with the Christchurch Call" and "Evaluating Frontier Models with the UK AI Safety Institute." We describe types of scrutiny of AI systems that could be facilitated by current setups and OpenMined's proposed future setups. We conclude that these innovative approaches deserve further exploration and support from the AI governance community. Interested policymakers can focus on empowering researchers on a legal level.
- Europe > United Kingdom (0.05)
- Oceania > New Zealand (0.04)
- North America > United States > Delaware (0.04)
- (2 more...)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Government > Regional Government > North America Government > United States Government (0.47)
British AI startup with government ties is developing tech for military drones
A company that has worked closely with the UK government on artificial intelligence safety, the NHS and education is also developing AI for military drones. The consultancy Faculty AI has "experience developing and deploying AI models on to UAVs", or unmanned aerial vehicles, according to a defence industry partner company. Faculty has emerged as one of the most active companies selling AI services in the UK. Unlike the likes of OpenAI, Deepmind or Anthropic, it does not develop models itself, instead focusing on reselling models, notably from OpenAI, and consulting on their use in government and industry. Faculty gained particular prominence in the UK after working on data analysis for the Vote Leave campaign before the Brexit vote.
- Government > Regional Government > Europe Government > United Kingdom Government (1.00)
- Government > Military (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles > Drones (0.91)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.58)
- Asia > China (0.16)
- North America > United States > California > San Francisco County > San Francisco (0.07)
- Europe > France (0.06)
- (7 more...)